A new differential LSI space-based probabilistic document classifier
نویسندگان
چکیده
We have developed a new effective probabilistic classifier for document classification by introducing the concept of differential document vectors and DLSI (differential latent semantic indexing) spaces. A combined use of the projections on and the distances to the DLSI spaces introduced from the differential document vectors improves the adaptability of the LSI (latent semantic indexing) method by capturing unique characteristics of documents. Using the intraand extra-document statistics, both a simple posteriori calculation on a small example and an experiment on a large Reuters-21578 database demonstrate the advantage of the DLSI space-based probabilistic classifier over the LSI space-based classifier in classification performance. 2003 Elsevier B.V. All rights reserved.
منابع مشابه
A differential LSI method for document classification
We have developed an effective probabilistic classifier for document classification by introducing the concept of the differential document vectors and DLSI (differential latent semantics index) spaces. A simple posteriori calculation using the intraand extra-document statistics demonstrates the advantage of the DLSI space-based probabilistic classifier over the popularly used LSI space-based c...
متن کاملSpam Filtering using Contextual Network Graphs
This document describes a machine-learning solution to the spam-filtering problem. Spam-filtering is treated as a text-classification problem in very high dimension space. Two new text-classification algorithms, Latent Semantic Indexing (LSI) and Contextual Network Graphs (CNG) are compared to existing Bayesian techniques by monitoring their ability to process and correctly classify a series of...
متن کاملLatent Semantic Indexing Based on Factor Analysis
The main purpose of this paper is to propose a novel latent semantic indexing (LSI), statistical approach to simultaneously mapping documents and terms into a latent semantic space. This approach can index documents more effectively than the vector space model (VSM). Latent semantic indexing (LSI), which is based on singular value decomposition (SVD), and probabilistic latent semantic indexing ...
متن کاملA New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier
With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...
متن کاملImproved Skips for Faster Postings List Intersection
Information retrieval can be achieved through computerized processes by generating a list of relevant responses to a query. The document processor, matching function and query analyzer are the main components of an information retrieval system. Document retrieval system is fundamentally based on: Boolean, vector-space, probabilistic, and language models. In this paper, a new methodology for mat...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Inf. Process. Lett.
دوره 88 شماره
صفحات -
تاریخ انتشار 2003